Welcome to Intern Insight, Data Tech College’s Dedicated Student Internship Admissions Portal!
The goal:
Interact with the data for an enhanced user experience.
Provide effective insights into the internship activity of Data Tech College students each year.
Foster data-driven action to help advance the careers of the students.
About the Data
The dataset contained summer internship results on 80 students who attend Data Tech College. Features within the data included academic attributes such as the student’s test score, GPA and writing scores. Holistic features were also included such as volunteer and work experience. Demographic information such as state and gender were also found in the data.
The original internship admissions data set contained outliers such as erroneous GPA and demographic values, which were subsequently removed during data preprocessing. All visualizations and results presented below are based on the cleaned data set, which excludes the outlier rows.
Further Clarifications:
Students are applying strictly to data science internships around the United States.
Each student applied to one internship for the summer of 2023, and each student heard back on whether they were accepted, waitlisted, or declined.
The state represents the state in which the intership applied to is located.
Test scores are referring to Leet Code test scores.
At A Glance - Summer 2023
Median Trends by Metric
Below is an interactive bar plot displaying the median of each numerical metric (GPA, Test Score, etc.) by internship admissions decision. To switch between metrics, click on the drop down and select your metric of choice.
Note: The median was chosen as opposed to the mean due to the skew apparent in most of the features of the data set.
Code
# Import packagesimport pandas as pdimport numpy as npimport altair as altimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.graph_objects as goimport plotly.express as pximport plotly.io as piofrom mpl_toolkits.mplot3d import Axes3Dfrom vega_datasets import data# Read in clean datadf = pd.read_csv("../data/clean_data.csv")# Get mean and median dfsmeans = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["mean"]).reset_index()means.columns = means.columns.droplevel(1)means.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]medians = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["median"]).reset_index()medians.columns = medians.columns.droplevel(1)medians.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]# Plotly visualvariables = ["GPA", "Test Score", "Writing Score", "Work Experience", "Volunteer Level"]order = ["Admit", "Waitlist", "Decline"]pio.renderers.default ="plotly_mimetype+notebook"# Add tracesplot = go.Figure(data=[ go.Bar( name ="GPA", x = medians["Decision"], y = medians["GPA"], marker_color ="#0E6BA8" ), go.Bar( name ="Test Score", x = medians["Decision"], y = medians["Test Score"], marker_color ="#6F0624", visible =False ), go.Bar( name ="Writing Score", x = medians["Decision"], y = medians["Writing Score"], marker_color ="#8B748F", visible =False ), go.Bar( name ="Work Experience", x = medians["Decision"], y = medians["Work Experience"], marker_color ="#00072D", visible =False ), go.Bar( name ="Volunteer Level", x = medians["Decision"], y = medians["Volunteer Level"], marker_color ="#0A2472", visible =False ) ]) # Set the initial view to JUST be GPAinitial_view = {"visible": [True, False, False, False, False]}# List of titles to usetitles = ["Median GPA", "Median Test Score", "Median Writing Score", "Median Years of Work Experience", "Median Volunteer Level"]# Dropdownplot.update_layout( updatemenus=[ dict( active =0, x =-0.1, y =0.7, buttons=list([ dict(label = variable, method ="update", args=[{"visible": [i == j for i inrange(len(variables))]}, {"title": f"{titles[j]} by Admissions Decision", "xaxis_title": "Admissions Decision", "yaxis_title": titles[j], "xaxis": {"categoryorder": "array", "categoryarray": order} }]) for j, variable inenumerate(variables) ]), ) ], title_text =f"{titles[0]} by Admissions Decision", xaxis =dict(categoryorder="array", categoryarray=order), showlegend =True, margin =dict(l =50, r =50, t =50, b =50)) plot.show()
Demographic Insights
The following section breaks down the relationship between demographics (gender & state) and internship application decisions.
Geographics
To provide an overview of the data, we will be looking at the data from a geographic perspective, specifically at the state level.
Below is an interactive map of students per state and outcome of internship applications. Note that for most states and decisions there are only a handful of students that were admitted, waitlisted, and declined. To see the breakdown of outcome by state, hover over the desired state.
Florida has the most internship applications, followed by Colorado and California. In California, 66% of the students who applied to internships were admitted, and in Colorado about 50% of students were admitted. However, in Florida, where most students applied to, only about 30% were admitted. This indicates that there is a big interest in Data Tech College students to intern in Florida but are not successful in getting admitted.
One recommendation to Data Tech College would be to conduct more research about the industries that students are interested in for internships, and thus create more courses to support those interests. If 66% of students who apply to internships in California get admitted, it could indicate that the courses currently offered at Data Tech College align with the knowledge expected at industries that are predominant in California, such as climate change data science. However, if students are not getting admitted to internships in Florida, where biotech is a sought-after industry for example, more courses that reflect the interest of students could be created to better prepare students for their desired internships.
Above is a choropleth map of the average numeric feature (GPA, test score, writing score, work experience in years, and volunteer level) by state. The average of the numeric features is calculated across all decision types to obtain a holistic view of the student data by state. Below we will summarize some findings for each feature:
GPA
Test Score
Writing Score
California has the highest average GPA, with Florida and New York close behind.
California has the highest average test score.
California has the highest average writing score.
Oregon and Mississippi have the lowest average GPA.
Mississippi has the lowest average test score.
New York has the lowest average writing score.
Work Experience
Volunteer Level
Mississippi has the highest average work experience in years.
Oregon has the highest average volunteer level.
Oregon has the lowest average work experience.
Alabama has the lowest average volunteer level.
California has the highest avaerage academic features, specifically GPA, test score, and writing score. This indicates that for internships in California, Data Tech College students are adequately academically prepared.
We can also look at some of these features at the geographic level by decision.
As we can see from the average GPA and test scores for admitted and declined students by state, students who were admitted had higher GPAs and test scores than those who were declined.
One notable example for both GPA and test score is Florida. About 30% of students who applied to internships in Florida were admitted, and those students averaged to having high GPAs and high test scores. But of those students who were declined, they had a high average GPA of 3.5, which is considered a good GPA. Students who applied to Florida internships but were declined also had the highest average test score of those declined.
The finding from Florida confirms the need to expand research into students’ desired industries and provide courses to support industry knowledge. Florida’s example also shows how test score could play a more important role in admitted outcomes than GPA.
This insight can help Data Tech College to improve students’ test scores so as to increase their chances of being admitted to an internship.
Decision Rates by State
We can also see the rates of students admitted and declined from internships by state to see overall how successful are the students from the selected states.
Code
#create dataframe of rates for each state by decisiondecision_state = df.groupby(['Decision', 'State'])[["GPA"]].count().reset_index()decision_state = decision_state.rename(columns={'GPA':'StateCount'})decision_state['DecisionCount'] = decision_state.groupby('Decision')['StateCount'].transform('sum')decision_state['Rate'] = decision_state['StateCount'] / decision_state['DecisionCount'] *100state_id_dict =dict(zip(data.population_engineers_hurricanes()["state"], data.population_engineers_hurricanes()["id"]))decision_state["StateID"] = decision_state["State"].map(state_id_dict)admit_states = decision_state[decision_state['Decision'] =="Admit"]decline_states = decision_state[decision_state['Decision'] =="Decline"]
Above are the maps of the rates of the students admitted by state and the rates of the students declined by state. Some findings from the maps are:
Florida had the highest rate of admitted students.
Utah had the lowest rate of admitted students.
Florida also has the highest rate of rejected students.
California, Oregon, and Mississippi all have the lowest rate of rejected students.
There is no clear relationship between admissions and rejections by state. Data Tech College, however, should try to expand the domains covered in the curriculum to incorporate various industries that have hubs across the country.
Gender
It is important to establish that internship opportunities are given fairly and equitably to all students regardless of gender. Analyzing decisions by gender can highlight any discrepancies or biases in the selection process, which is the primary focus of the following section.
The heatmap above display internship decision counts among females and males. Because the colors represent counts in very similar ranges, it appears gender is not a contribuing factor to internship decisions. However, it’s important to back that statement with statistics, such as a Chi-Square test. Due to the relatively small size of the data set, a chi-square test may not always be accurate. An exact test was also performed, Fisher’s Exact Test, to verify the result from the Chi-Square test.
The Chi-Square statistic measures the difference between gender frequencies in each decision category within the data and gender frequencies in each decision that would be expected if there was no association between the variables. A lower chi-square value as seen above, indicates that the observed frequencies are very close to the expected frequencies. The large P-value (greater than 0.05) confirms that there is no significant association between gender and admission decisions, indicating no evidence of bias based on this test; a desired result.
Code
from scipy.stats import fisher_exactcont_table_small = cont_table[["Admit", "Decline"]]odds_ratio, p_value_fish = fisher_exact(cont_table_small)
Statistic - Fisher’s Exact Test
Value
Fisher’s Odds Ratio
1.0833
Fisher’s Test P-Value
1.0
For Fisher’s Exact Test, the odds ratio is the ratio of the odds of an event occurring in one group compared to another. There is a positive association if the odds ratio is greater than 1. The odds ratio of 1.083 means that the odds of being admitted for one group are 8.3% higher than the odds of the other group being admitted. While Fisher’s Exact found a difference, the p-value for the test is equal to 1, which is higher than above any given significance level. This verifies the results of the Chi-Square test, indicating no evidence of bias between admisisons and gender.
Academic & Holistic Insights
Though the usage of pariplots and machine learning techniques, relationships between student’s academic features (GPA, writing score, test score) and the internship application outcome can be understood. This information can help students at Data Tech understand what features of their application may contribute to internship decisions and how strongly. Futhremore, this analysis can provide insight into areas the curriculum that may or may not need targeted attention to ensure the students increase their internship admission chances.
Above is a pairplot of GPA, writing score, and test score of the students grouped by the decision. When looking at the scatterplots, we notice some patterns:
Students with low test score, no matter the GPA, were declined.
Students with high test score and high GPA were accepted.
Students with a pretty high GPA but average test score were waitlisted.
Students with high test score, no matter the writing score, were admitted.
Students with a low test score, no matter the writing score, were declined.
Students with high writing scores but average test score were waitlisted.
Through the pairplot, it is apparent that some of the academic features have relationships by decision result, but some features seem to be more important than others.
To understand the factors influencing college student internship decisions, we employed a tree-based machine learning model, specifically XGBoost, to analyze the data. To interpret the model’s predictions and assess the impact of each factor, we utilized Shapley values, a concept from cooperative game theory. This analysis enables us to identify which factors most strongly influence internship decisions. Gaining insights into these relationships will help us pinpoint areas for improvement in the college curriculum, ensuring that students are well-prepared and have the highest likelihood of securing summer internships.
The figure above quantifies student’s academic and holistic attributes influence on internship application outcomes overall. Higher SHAP values mean those features have a greater impact internship decisions. Conversely, lower SHAP values indicate factors that are less important in the internship decision making process.
As we can see, test scores, GPA, and writing scores are among the top contributors while features such as work experience and volunteer level are not weighed as heaviliy. This highlights that companies are looking to the student’s academic background as a main focus for their decision compared to their holistic attributes.
While the previous plot displayed how student attributes play a role in the internship decision making process overall, Shapley statistics allow further steps to be taken by analyzing how each feature contributes to each possible deicion (Admit, Waitlist, Decline).
The figure above displays just that. It appears that the test score is the most significant factor contributing to a student’s likelihood of being admitted. It falls in line with the earlier pairplot as we saw significant overlap in internship decisions among GPAs and writing scores, but distinct separation for test scores. This suggests that students with higher test scores have a greater advantage in the competitive internship landscape.
For students who are placed on a waitlist, both test scores and GPA are important considerations. This might indicate that students on the waitlist have comparatively lower test scores than those who are admitted outright. The pairplot displays exactly that so it seems that for these students, academic performance is a deciding factor that could tip the balance in their favor for internship admission.
On the other hand, for students who are declined, the test score still holds considerable weight, but the writing score becomes notably more influential. This pattern could imply that declined students, while possibly having adequate test scores, may fall short in demonstrating the necessary writing proficiency, which is critical for many internships that require strong communication skills.
Conclusions
These insights suggest a couple of strategic focuses for the institution:
Test Score Improvement: Continue to prioritize and enhance test preparation services, ensuring that the students can achieve the highest scores possible.
Academic Support: Given the importance of GPA, particularly for waitlisted students, bolstering academic support can help these students improve their standing and increase their chances of moving from waitlist to admit.
Writing Proficiency: Addressing the writing skills that impact both waitlisted and declined students, consider expanding the writing centers and integrating more communication-focused workshops into the student services.
Industry Research/Course Creation: Conduct research of what industries students are applying to for internships to have better insight into what their interests are. Courses more tailored to the needs of Data Tech College students should then be created to increase the rate of admissions at internships across the country.
By concentrating on these areas, the college will help its students to not only meet but exceed the expectations of internship programs, thereby improving their chances of being admitted.